Suffix-DDs: Substring Indices Based on Sequence BDDs for Constrained Sequence Mining
نویسندگان
چکیده
In this paper, we study an efficient index structure, called Suffix Decision Diagrams (SuffixDDs), for knowledge discovery in large sequence data. Recently, Loekito, Bailey, and Pei (KAIS, 2009) proposed a new data structure for sequence data, called Sequence Binary Decision Diagram (SeqBDD), which is an extension of Zero-suppressed Binary Decision Diagrams (ZDDs) for sequences. SuffixDD is a compact substring indices based on SeqBDD for efficiently representing the set of all substrings of a given string. Furthermore, SuffixDD provides a rich collection of operations for sets of sequences inherited from ZDDs and SeqBDDs, which are useful for implementing sequence mining algorithms. Then, we present an efficient algorithm for constructing a SuffixDD for a given text, and then, we show the correctness and the complexity. Furthermore, we present a set of extended operations for manipulating SuffixDDs with application to constrained sequence mining. Finally, we give experimental results on the efficiency of SuffixDD.
منابع مشابه
Building Substring Indices Using Sequence BDDs
(Abstract) There is a demand for efficient indexed-substring data structures, which can store all substrings of a given text. Suffix trees and Directed Acyclic Word Graphs (DAWGs) are examples of substring indices, but they lack operations for manipulating sets of strings. The Sequence Binary Decision Diagram (SeqBDD) data structure proposed) is a new type of Binary Decision Diagram (BDD), and ...
متن کاملPairwise sequence alignment using bio-database compression by improved fine tuned enhanced suffix array
Sequence alignment is a bioinformatics application that determines the degree of similarity between nucleotide sequences which is assumed to have same ancestral relationships. This sequence alignment method reads query sequence from the user and makes an alignment against large and genomic sequence data sets and locate targets that are similar to an input query sequence. Existing accurate algor...
متن کاملORE extraction and blending optimization model in poly- metallic open PIT mines by chance constrained one-sided goal programming
Determination a sequence of extracting ore is one of the most important problems in mine annual production scheduling. Production scheduling affects mining performance especially in a poly-metallic open pit mine with considering the imposed operational and physical constraints mandated by high levels of reliability in relation to the obtained actual results. One of the important operational con...
متن کاملUsing an Extended Suffix Tree to Speed-up Sequence Alignment
An important problem in computational biology is the alignment of a given query sequence and sequences in a database to find similar (locally or globally) sequences from the database to the query. Many heuristic algorithms for this problem are based on the idea of locating a fixed-length matching pair of substrings (called a seed) to start an alignment, and then extending this alignment using d...
متن کاملRepeated Record Ordering for Constrained Size Clustering
One of the main techniques used in data mining is data clustering, which has many applications in computer science, biology, and social sciences. Constrained clustering is a type of clustering in which side information provided by the user is incorporated into current clustering algorithms. One of the well researched constrained clustering algorithms is called microaggregation. In a microaggreg...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010